Happily, we can use surveys of a sample of our population to learn things about our population
However, our ability to do this is conditional on how good our sample is
What do I mean by “good”?
The 2024 US Presidential Election
Elections are preceded by a flood of surveys
Surveys
Surveys are conducted on a subset (sample) of the population of interest
Our population of interest: individuals who voted in the 2024 US Presidential Election
A good sample
A good sample is a representative one
How closely does our sample reflect our population
Parallel worlds
Remember back to last session on experiments
In an ideal world, we would be able to create two parallel worlds (one with the treatment, one held as our control)
One version of the election booth run without monitors (the control)
One version with monitors (the treatment)
These worlds are perfectly identical to each other prior to treatment
We cannot do this :(
The next best thing
Our next best option is to create two groups that were as identical to one another as possible prior to treatment
If they are (almost) identical, differences between their group-wide outcomes can be attributed to the treatment
One good way of getting two (almost) identical groups is to assign individuals to those groups randomly
Think back to our 1,000 hypothetical people!
Randomization
Randomization continues to pop its chaotic head up
We can use it to create a sample that is (almost) identical to our population, on average
Drawing randomly from our population increases our chances of ending up with a sample that reflects that population
This would be referred to as a representative sample
Random sampling
All individuals in the population need to have an equal chance of being selected for the sample
If this holds, you have a pure random sample
This is really hard to do!
How likely were you to answer the pollster’s unknown number, calling you in the middle of the day?
Even if you did answer, how likely were you to answer all their questions?
Large numbers
Randomization isn’t enough: we also need to draw a sufficiently large sample from our population
One person pulled randomly from the class isn’t going to be very reflective of the class!
To illustrate
Countries’ GDP in 2022:
Countries’ GDP
I want to estimate the average GDP across all countries in 2022.
I send out a survey to all countries’ Departments of Statistics and ask for their GDP figures for 2022.
I get 60 responses:
sample_df <- gdp_df |>drop_na(sample_value) |>sample_n(size =60) |>transmute(country, gdp = sample_value)sample_df
# A tibble: 60 × 2
country gdp
<chr> <dbl>
1 Portugal 2.55e11
2 Bolivia 4.40e10
3 Peru 2.46e11
4 Japan 4.26e12
5 Kuwait 1.83e11
6 Viet Nam 4.10e11
7 Ecuador 1.17e11
8 Hong Kong SAR, China 3.59e11
9 Angola 1.04e11
10 Russian Federation 2.27e12
# ℹ 50 more rows
Countries’ GDP
I now calculate the average of these responses, which I find to be: